UniPixel-3B is a unified multimodal large language model for pixel-level vision-language understanding, which can flexibly support various fine-grained tasks, including image/video segmentation, region understanding, and the novel PixelQA task. The model jointly requires object-centric reference, segmentation, and question answering in videos, achieving pixel-level visual reasoning capabilities.
Multimodal
Safetensors